Day 18 urllib模組

2021 iThome 鐵人賽

DAY 18

影片教學

文組生的Python爬蟲之旅系列第 18 篇

13th鐵人賽

水母君

2021-10-02 10:05:28

5948 瀏覽

分享至

今天的影片內容為介紹Python內建用來下載網頁資訊所使用的模組—urllib
某些觀念和前兩天所介紹的requests模組是差不多的呦～
而在最後會簡單地介紹向網頁伺服器求取資料的規範—robots.txt

以下為影片中有使用到的程式碼

#檢查資料型態
import urllib.request

url = "https://new.ntpu.edu.tw/"
htmlfile = urllib.request.urlopen(url) 
print(type(htmlfile))

#使用read()讀取物件
import urllib.request

url = "https://new.ntpu.edu.tw/"
htmlfile = urllib.request.urlopen(url)
print(htmlfile.read())

#轉成utf-8編碼
import urllib.request

url = "https://new.ntpu.edu.tw/"
htmlfile = urllib.request.urlopen(url)
print(htmlfile.read().decode('utf-8'))

#HTTPResponse物件常用屬性
import urllib.request

url = "https://new.ntpu.edu.tw/"
htmlfile = urllib.request.urlopen(url)
print("物件網址:", htmlfile.geturl())
print("下載情形:", htmlfile.status) #列印出整數200為成功獲取
print("表頭內容:", htmlfile.getheaders())

#試試其他網站吧!
import urllib.request

url = "https://www.gamer.com.tw/"
htmlfile = urllib.request.urlopen(url)
print(htmlfile.read())

#增加表頭
import urllib.request

url = "https://www.gamer.com.tw/"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'}

Req = urllib.request.Request(url, headers = headers)
htmlfile = urllib.request.urlopen(Req)
print(htmlfile.read().decode('utf-8'))

如果在影片中有說得不太清楚或錯誤的地方，歡迎留言告訴我，謝謝您的指教。